Dataset sourced from: https://github.com/jasonshi10/art_auction_valuation?tab=readme-ov-file
37,638 Unique Rows
23 Columns (Price, Material, Height, etc.)
Artist- Artist Name
Country - Country artist is from
YearofBirth - Artist’s birth year
YearOfDeath - Artist’s death year
Name - Name of the artwork
Year - Year artwork was created
Material - Materials used for the Artwork
Height - Height of artwork in inches
Width- Width of artwork in inches
Link - Link to an image of artwork
Source - Where the data was originally scraped
from
DominantColor - The dominant color in an
artwork
Brightness - Mean brightness of an artwork. A value
closer to 0 denotes a dark image and that closer to 255 indicates a
bright one
RatioUniqueColors - The number of unique colors in
an image as a ration of the total number of pixels
thresholdBlackPerc - If pixel value is greater than
a threshold value(here we use 127, range from 0-255), it is assigned one
value (255,white), else it is assigned another value (0,black). Then
calculate the percentage of white or black in the image, and get the
ratio of black pixels in the greyscale of paintings
HighbrightnessPerc - Calculate the average
brightness of each paintings and how many pixels have two times of the
average brightness, then get ratio of these two numbers.
LowbrightnessPerc - Calculate the average brightness
of each paintings and count how many pixels have less than half of the
mean brightness of that image, then get ratio of these two
numbers.
CornerPerc - Use Harris Corner Detection algorithm
to detect the corner in the artworks. Corner is the intersection of two
edges, it represents a point in which the directions of these two edges
change. Hence, the gradient of the image (in both directions) have a
high variation, which can be used to detect it. With that, we can
calculate the ratio of pixels as corners in the full image.
EdgePer - Use Canny Edge Detection algorithm to
detect the edges in the image. And then calculate the percentage of
pixels recognized as edges in the whole picture.
FaceCount - Number of faces in an artwork’s
images
Sold Time - When the auction sales
happened.
Price - Amount artwork sold for in US Dollars ($)
The features that may be helpful in producing a model are
material, height, width,
dominantColor, brightness,
ratioUniqueColors, thresholdBlackPerc,
highbrightnessPerc, lowbrightnessPerc,
CornerPer, EdgePer, and
FaceCount.
library(ggplot2)
library(naniar) # Load nanair for missing data visualization
library(OneR)
library(tidyverse)
── Attaching core tidyverse packages ────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ forcats 1.0.0 ✔ readr 2.1.5
✔ lubridate 1.9.3 ✔ stringr 1.5.1
✔ purrr 1.0.2 ✔ tidyr 1.3.1
── Conflicts ──────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ purrr::accumulate() masks foreach::accumulate()
✖ randomForest::combine() masks dplyr::combine()
✖ neuralnet::compute() masks dplyr::compute()
✖ mice::filter() masks dplyr::filter(), stats::filter()
✖ dplyr::lag() masks stats::lag()
✖ purrr::lift() masks caret::lift()
✖ randomForest::margin() masks ggplot2::margin()
✖ xgboost::slice() masks dplyr::slice()
✖ purrr::when() masks foreach::when()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(tidytext)
library(dplyr)
# Read in the data
data <- read.delim("~/Desktop/MachineLearning/Final Project/data.txt")
# Take out data that's not needed (X, yearOfBirth, yearOfDeath, soldTime)
art_data <- data[, c(2:3, 6:11, 14:22)]
# Removed a row due to data being inaccurate.
art_data <- art_data[-34541, ]
# Look at the structure of art_data
str(art_data)
'data.frame': 41252 obs. of 17 variables:
$ artist : chr "Mario A" "Mario A" "A E Cremer" "A G Schultz & Co." ...
$ country : chr "Swiss" "Swiss" "French" "American" ...
$ name : chr "The world is beautyful #5" "The world is beautyful #13" "Spot Lights" "Sugar/Sweetmeat Baskets" ...
$ year : chr "2004" "2004" "" "" ...
$ price : num 5315 7383 2090 615 8125 ...
$ material : chr "laserchrome_print_diasec" "laserchrom_print_(diasec.)" "black-painted_metal" "Sterling_Silver" ...
$ height : chr "29.53" "29.53" "" "6.5" ...
$ width : chr "39.37" "39.37" "" "5.75" ...
$ dominantColor : chr "yellows" "blacks" "whites" "blacks" ...
$ brightness : num 98 73 212 73 216 45 188 94 204 222 ...
$ ratioUniqueColors : num 0.25 0.19 0.05 0.18 0.02 0.19 0 0 0.12 0.16 ...
$ thresholdBlackPerc: num 65.7 66.2 20.2 83 11.3 ...
$ highbrightnessPerc: num 0.75 15.21 0 13.74 0 ...
$ lowbrightnessPerc : num 21.56 46.5 17.56 35.74 6.27 ...
$ CornerPer : num 0.37 1.52 0.39 3.58 1.25 0.34 1.14 0.59 1.08 1.18 ...
$ EdgePer : num 4.02 7.28 4.15 13.13 12.95 ...
$ FaceCount : num 1 0 0 0 0 0 0 0 0 0 ...
summary(art_data$price)
Min. 1st Qu. Median Mean 3rd Qu. Max.
20 1638 6605 241529 23750 119922500
Some rows in our data are empty but not set to N/A. Need to convert those empty values to N/A
art_data <- as.data.frame(lapply(art_data, function(x) {
ifelse(x == "", NA, x)
}))
Visualize our response variable, Price Due to the large range of values in Price, we decided to take the natural log of price \(log(price + 1)\). This will help better visualize price.
summary(art_data$price)
Min. 1st Qu. Median Mean 3rd Qu. Max.
20 1638 6605 241529 23750 119922500
Taking the log(Price + 1), this helped reduce the skew of
Price. Making it easier to visualize
Look at the missing values in our data…
# Visualize missing features
feat_vars <- names(art_data)[c(4, 6:17)]
vis_miss(art_data[, feat_vars])
# Visualize missing features with our response variable price
t_bins <- bin(art_data$log_price, nbins = 6, method = "length") # Bin response variable
plot_dat <- cbind.data.frame(t_bins, art_data[, feat_vars])
gg_miss_fct(x = plot_dat, fct = t_bins) +
labs(x = "Price")
Year, width, and height are the only features that have
missing data,with Year having the most missing values. Year having 32%
missing, making it a feature we won’t use for building our
model.
Find the most common materials used in our
dataframe.
art_data$material <- str_replace_all(art_data$material, '_', ' ')
# Figure out the most common words/phrase used in materials column using the tidyverse
word_count <- art_data %>%
unnest_tokens(word, material) %>%
anti_join(stop_words, by = "word") %>%
count(word, sort = TRUE)
# Plot top 10 words/phrases
word_count %>%
slice_max(n, n = 10) %>%
ggplot(aes(x = reorder(word, n), y = n)) +
geom_col() +
coord_flip() +
labs(x = "Word", y = "Count", title = "Most Common Words")
Oil is the most used material in our dataset
# Pull out major material categories like oil
oil <- rep(0, nrow(art_data))
oil[grep("oil", art_data$material)] <- 1
sum(oil[grep("oil", art_data$material)])
[1] 13511
# 13511 Artworks use Oil as a material
# Create Factor column of oil
art_data$oil <- as.factor(oil)
# Visualize Oil and Price
g_2 <- ggplot(art_data, aes( y = log_price, x = oil, fill = oil)) + # Set x and fill as disagnosis, y as value
geom_boxplot() + # Use boxlot
theme_bw() + # Set theme
theme(panel.grid.major = element_blank(), # Remove grid
panel.grid.minor = element_blank(), # Remove grid
panel.border = element_blank(), # Remove grid
panel.background = element_blank()) + # Remove grid
labs(x = "Oil", title = "Oil vs Price",
fill = "Oil") + # Set labels
scale_fill_manual(values = c("1" = "red", "0" = "blue"), # Manually set fill values
labels = c("1" = "Oil", "0" = "Other Material Used"))
g_2
# Pull out acrylic in material and visualize
acrylic <- rep(0, nrow(art_data))
acrylic[grep("acrylic", art_data$material)] <- 1
art_data$acrylic <- as.factor(acrylic)
# Visualize Acrylic and Price
g_3 <- ggplot(art_data, aes( y = log_price, x = acrylic, fill = acrylic)) +
geom_boxplot() +
theme_bw() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank()) +
labs(x = "Acrylic", title = "Acrylic vs Price",
fill = "Acrylic") + # Set labels
scale_fill_manual(values = c("1" = "red", "0" = "blue"),
labels = c("1" = "Acrylic", "0" = "Other Material Used"))
g_3
# Watercolor
# Pull out watercolor in material and visualize
watercolor <- rep(0, nrow(art_data))
watercolor[grep("watercolor", art_data$material)] <- 1
art_data$watercolor <- as.factor(watercolor)
# Visualize Acrylic and Price
g_4 <- ggplot(art_data, aes( y = log_price, x = watercolor, fill = watercolor)) +
geom_boxplot() +
theme_bw() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank()) +
labs(x = "watercolor", title = "watercolor vs Price",
fill = "watercolor") + # Set labels
scale_fill_manual(values = c("1" = "red", "0" = "blue"),
labels = c("1" = "watercolor", "0" = "Other Material Used"))
g_4
# Pull out Screenprint in material and visualize
screenprint <- rep(0, nrow(art_data))
screenprint[grep("screenprint", art_data$material)] <- 1
art_data$screenprint <- as.factor(screenprint)
# Visualize Screenprint and Price
g_5 <- ggplot(art_data, aes( y = log_price, x = screenprint, fill = screenprint)) +
geom_boxplot() +
theme_bw() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank()) +
labs(x = "screenprint", title = "screenprint vs Price",
fill = "screenprint") + # Set labels
scale_fill_manual(values = c("1" = "red", "0" = "blue"),
labels = c("1" = "screenprint", "0" = "Other Material Used"))
g_5
Through these visualizations, we can see that the
Material’s used to create an artwork does have an impact on our response
variable price.
Brightness and Log_Price
g_7 <- ggplot(art_data,
aes(y = log_price,
x = brightness)) +
geom_point(color = "blue", alpha = 0.10) +
geom_smooth(method = 'lm') +
theme_bw() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank()) +
labs(y = "Price", # Set plot labels
x = "Brightness",
title = "Brightness of Artwork vs Log_Price")
g_7
`geom_smooth()` using formula = 'y ~ x'
The Brightness of an artwork seems to have a somewhat
positive relationship with the log_price
Face Count
vs Price
art_data$FaceCount <- as.factor(art_data$FaceCount)
g_8 <- ggplot(art_data, aes( y = log_price, x = FaceCount, fill = FaceCount)) +
geom_boxplot() +
theme_bw() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank())
g_8
The face count of an artwork seems to have a relationship
with log_price, but there are many outliers with log_price and
FaceCount
Dominant Color vs Price
art_data$dominantColor <- as.factor(art_data$dominantColor)
g_9 <- ggplot(art_data, aes( y = log_price, x = dominantColor, fill = dominantColor)) +
geom_boxplot() +
theme_bw() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank())
g_9
Dominant Color seems to play a role in the price of an artwork, but there are many outliers. Ratio of Unique Colors vs Price
g_10 <- ggplot(art_data,
aes(y = log_price,
x = ratioUniqueColors)) +
geom_point(color = "blue", alpha = 0.10) +
geom_smooth(method = 'lm') +
theme_bw() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank()) +
labs(y = "Price", # Set plot labels
x = "Ratio Unique Colors",
title = "Ratio of Unique Colors vs Log_Price")
g_10
`geom_smooth()` using formula = 'y ~ x'
There seems to be a somewhat negative relationship with the
Ratio of Unique Colors and the Price of an Artwork. The less colors used
in an artwork or the higher ratio of unique colors, the price of an
artwork decreases
Visualize Height/Width with Price
art_data$width <- as.numeric(art_data$width)
Warning: NAs introduced by coercion
art_data$height <- as.numeric(art_data$height)
Warning: NAs introduced by coercion
# Decided to take the log of width & height of an artwork to reduce the skew...
g_12 <- ggplot(art_data,
aes(y = log_price,
x = log(width))) +
geom_point(color = "blue", alpha = 0.10) +
geom_smooth(method = 'lm') +
theme_bw() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank()) +
labs(y = "Log_Price", # Set plot labels
x = "log(Width) of Artwork",
title = "Log_Width vs Log_Price")
g_12
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2512 rows containing non-finite outside the scale range (`stat_smooth()`).
Warning: Removed 2512 rows containing missing values or values outside the scale range
(`geom_point()`).
g_13 <- ggplot(art_data,
aes(y = log_price,
x = log(height))) +
geom_point(color = "blue", alpha = 0.10) +
geom_smooth(method = 'lm') +
theme_bw() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank()) +
labs(y = "Log_Price", # Set plot labels
x = "log(Height) of Artwork",
title = "Log_Height vs Log_Price")
g_13
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2512 rows containing non-finite outside the scale range (`stat_smooth()`).
Warning: Removed 2512 rows containing missing values or values outside the scale range
(`geom_point()`).
g_14 <- ggplot(art_data,
aes(y = log_price,
x = log(height*width))) +
geom_point(color = "blue", alpha = 0.10) +
geom_smooth(method = 'lm') +
theme_bw() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank()) +
labs(y = "Log_Price", # Set plot labels
x = "Log_Area of Artwork",
title = "Log_Area vs Log_Price")
g_14
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 2512 rows containing non-finite outside the scale range (`stat_smooth()`).
Warning: Removed 2512 rows containing missing values or values outside the scale range
(`geom_point()`).
There seems to be a somewhat positive relationship with the
area of an artwork and its price
Create a new column
to identify a well known artist vs not well known artist based on the
average price of artworks.
There are 8,608 unique artists in the
data-set.
summary(art_data$well_known)
0 1
23277 17975
Visualize Artwork Price and if the Artist is considered well known or not.
art_data$well_known <- as.factor(art_data$well_known)
g_11 <- ggplot(art_data, aes( y = log_price, x = well_known, fill = well_known)) +
geom_boxplot() +
theme_bw() +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank())
g_11
Well known artists have a higher average artwork price compared to lesser known artists.
save(art_data, artist_avg, plot_dat, word_count, data,
file = "final_project_data.RData")